43 research outputs found
Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN)
architecture that has been designed to address the vanishing and exploding
gradient problems of conventional RNNs. Unlike feedforward neural networks,
RNNs have cyclic connections making them powerful for modeling sequences. They
have been successfully used for sequence labeling and sequence prediction
tasks, such as handwriting recognition, language modeling, phonetic labeling of
acoustic frames. However, in contrast to the deep neural networks, the use of
RNNs in speech recognition has been limited to phone recognition in small scale
tasks. In this paper, we present novel LSTM based RNN architectures which make
more effective use of model parameters to train acoustic models for large
vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at
various numbers of parameters and configurations. We show that LSTM models
converge quickly and give state of the art speech recognition performance for
relatively small sized models
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent
neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as
acoustic models for speech recognition. More recently, we have shown that the
performance of sequence trained context dependent (CD) hidden Markov model
(HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained
phone models initialized with connectionist temporal classification (CTC). In
this paper, we present techniques that further improve performance of LSTM RNN
acoustic models for large vocabulary speech recognition. We show that frame
stacking and reduced frame rate lead to more accurate models and faster
decoding. CD phone modeling leads to further improvements. We also present
initial results for LSTM RNN models outputting words directly.Comment: To be published in the INTERSPEECH 2015 proceeding
Federated Learning Of Out-Of-Vocabulary Words
We demonstrate that a character-level recurrent neural network is able to
learn out-of-vocabulary (OOV) words under federated learning settings, for the
purpose of expanding the vocabulary of a virtual keyboard for smartphones
without exporting sensitive text to servers. High-frequency words can be
sampled from the trained generative model by drawing from the joint posterior
directly. We study the feasibility of the approach in two settings: (1) using
simulated federated learning on a publicly available non-IID per-user dataset
from a popular social networking website, (2) using federated learning on data
hosted on user mobile devices. The model achieves good recall and precision
compared to ground-truth OOV words in setting (1). With (2) we demonstrate the
practicality of this approach by showing that we can learn meaningful OOV words
with good character-level prediction accuracy and cross entropy loss
Mobile Keyboard Input Decoding with Finite-State Transducers
We propose a finite-state transducer (FST) representation for the models used
to decode keyboard inputs on mobile devices. Drawing from learnings from the
field of speech recognition, we describe a decoding framework that can satisfy
the strict memory and latency constraints of keyboard input. We extend this
framework to support functionalities typically not present in speech
recognition, such as literal decoding, autocorrections, word completions, and
next word predictions.
We describe the general framework of what we call for short the keyboard "FST
decoder" as well as the implementation details that are new compared to a
speech FST decoder. We demonstrate that the FST decoder enables new UX features
such as post-corrections. Finally, we sketch how this decoder can support
advanced features such as personalization and contextualization
An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models
Speaker-independent speech recognition systems trained with data from many
users are generally robust against speaker variability and work well for a
large population of speakers. However, these systems do not always generalize
well for users with very different speech characteristics. This issue can be
addressed by building personalized systems that are designed to work well for
each specific user. In this paper, we investigate the idea of securely training
personalized end-to-end speech recognition models on mobile devices so that
user data and models never leave the device and are never stored on a server.
We study how the mobile training environment impacts performance by simulating
on-device data consumption. We conduct experiments using data collected from
speech impaired users for personalization. Our results show that
personalization achieved 63.7\% relative word error rate reduction when trained
in a server environment and 58.1% in a mobile environment. Moving to on-device
personalization resulted in 18.7% performance degradation, in exchange for
improved scalability and data privacy. To train the model on device, we split
the gradient computation into two and achieved 45% memory reduction at the
expense of 42% increase in training time
Federated Learning for Emoji Prediction in a Mobile Keyboard
We show that a word-level recurrent neural network can predict emoji from
text typed on a mobile keyboard. We demonstrate the usefulness of transfer
learning for predicting emoji by pretraining the model using a language
modeling task. We also propose mechanisms to trigger emoji and tune the
diversity of candidates. The model is trained using a distributed on-device
learning framework called federated learning. The federated model is shown to
achieve better performance than a server-trained model. This work demonstrates
the feasibility of using federated learning to train production-quality models
for natural language understanding tasks while keeping users' data on their
devices
Understanding Unintended Memorization in Federated Learning
Recent works have shown that generative sequence models (e.g., language
models) have a tendency to memorize rare or unique sequences in the training
data. Since useful models are often trained on sensitive data, to ensure the
privacy of the training data it is critical to identify and mitigate such
unintended memorization. Federated Learning (FL) has emerged as a novel
framework for large-scale distributed learning tasks. However, it differs in
many aspects from the well-studied central learning setting where all the data
is stored at the central server. In this paper, we initiate a formal study to
understand the effect of different components of canonical FL on unintended
memorization in trained models, comparing with the central learning setting.
Our results show that several differing components of FL play an important role
in reducing unintended memorization. Specifically, we observe that the
clustering of data according to users---which happens by design in FL---has a
significant effect in reducing such memorization, and using the method of
Federated Averaging for training causes a further reduction. We also show that
training with a strong user-level differential privacy guarantee results in
models that exhibit the least amount of unintended memorization
Applied Federated Learning: Improving Google Keyboard Query Suggestions
Federated learning is a distributed form of machine learning where both the
training data and model training are decentralized. In this paper, we use
federated learning in a commercial, global-scale setting to train, evaluate and
deploy a model to improve virtual keyboard search suggestion quality without
direct access to the underlying user data. We describe our observations in
federated training, compare metrics to live deployments, and present resulting
quality increases. In whole, we demonstrate how federated learning can be
applied end-to-end to both improve user experiences and enhance user privacy
Federated Learning for Mobile Keyboard Prediction
We train a recurrent neural network language model using a distributed,
on-device learning framework called federated learning for the purpose of
next-word prediction in a virtual keyboard for smartphones. Server-based
training using stochastic gradient descent is compared with training on client
devices using the Federated Averaging algorithm. The federated algorithm, which
enables training on a higher-quality dataset for this use case, is shown to
achieve better prediction recall. This work demonstrates the feasibility and
benefit of training language models on client devices without exporting
sensitive user data to servers. The federated learning environment gives users
greater control over the use of their data and simplifies the task of
incorporating privacy by default with distributed training and aggregation
across a population of client devices.Comment: 7 pages, 4 figure
On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech
While current state-of-the-art Automatic Speech Recognition (ASR) systems
achieve high accuracy on typical speech, they suffer from significant
performance degradation on disordered speech and other atypical speech
patterns. Personalization of ASR models, a commonly applied solution to this
problem, is usually performed in a server-based training environment posing
problems around data privacy, delayed model-update times, and communication
cost for copying data and models between mobile device and server
infrastructure. In this paper, we present an approach to on-device based ASR
personalization with very small amounts of speaker-specific data. We test our
approach on a diverse set of 100 speakers with disordered speech and find
median relative word error rate improvement of 71% with only 50 short
utterances required per speaker. When tested on a voice-controlled home
automation platform, on-device personalized models show a median task success
rate of 81%, compared to only 40% of the unadapted models